Prosper, or Prosper Marketplace, is a leader in the online peer-to-peer lending industry. Borrowers create profiles and listings (request loans), investors either individuals or institutions, view the listing (borrower’s loan request) and decide how much to lend the borrower towards the loan.
This Prosper Loan data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
There are 3 stakeholder in this business model – borrower, investor and the company Prosper. From the business perspective, satisfying need of target customer is the most important thing . This project will focuse in customer analysis and explore profit-driven loan for borrower and lender and conduct time study to see loans changes over the year. Firstly, it will explore the classification of target customer - borrower and lender, respectively. Secondly, for borrower, the analysis will explore which group of borrower has higher chance to have the changed off debt. Thirdly,the analysis is trying to figure out what kind of investment(combined factors) is the most popular for investor.
In the begining of the project, I want to use univariate plots to explore the overall statistic for Prosper loans and its customer distribution. At first, let’s check the overall distribution of Prosper ratings listing to understand more of loan characteristics.
nrow(loan) # check number of rows
## [1] 113937
ncol(loan) #check number of columns
## [1] 81
names(loan) #Check the name of the dataset
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
Loan Prosper Rating is the fisrt factor I want to check
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 16965 58308 7433 7189 2395 756 2572 10494 199 85 91 217
## 12 13 14 15 16 17 18 19 20
## 59 1996 876 1522 304 52 885 768 771
From Loan Prosper Rating histogram, we can see there is near 30,000 unrated listing. Also, Rating C loans has highest number among all the listing, which is close to 18,000 listings.B, C rating Loan are top 2 and to 3 popular ratings. Also,from Listing Category Frequency Histogram, we can also learn that Debt Consolidation is Top1 category of all the listing category and has 58308, which is about 50% of total number of loans.Let’s also plot a time study histogry to see the number of loans.
## Q1 2006 Q1 2007 Q1 2008 Q1 2010 Q1 2011 Q1 2012 Q1 2013 Q1 2014 Q2 2006
## 315 3079 3074 1243 1744 4435 3616 12172 1254
## Q2 2007 Q2 2008 Q2 2009 Q2 2010 Q2 2011 Q2 2012 Q2 2013 Q3 2006 Q3 2007
## 3118 4344 13 1539 2478 5061 7099 1934 2671
## Q3 2008 Q3 2009 Q3 2010 Q3 2011 Q3 2012 Q3 2013 Q4 2005 Q4 2006 Q4 2007
## 3602 585 1270 3093 5632 9180 22 2403 2592
## Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Q4 2013
## 532 1449 1600 3913 4425 14450
Year variable needed to be extracted and transfomed from quater variable at first and plot histogram using both time varaibles. From these two graphs, we can see the number of total Prosper loan has ups and downs trend from 2006 to 2014. It went up from 2006 to 2008 and dropped heavily in 2009, which may be caused by financial crisis in 2008, then it increased again until 2014 Q1, 2014 only has 1 quarter data, which can not represent whole year. Overall, the total number of loan are increasing over the years.
Now I have seen some loan information, I want to know the cluster customer by Customer Risk Score to its distribution.
From the chart, we can conclude the customers’ risk score is normally distributed and about 70% customers’ risk score are between 3.5 and 8.5.From Borrow State Frequency chart, we can see CA- California has the highest number of loans, which is more than twice than the secondest highest state -Texas. This makes sense as Prosper’s was founded in California and it should has the largest base customer. After exploring borrower data, let’s check investor related variable - number of investor per loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
The max number of investors per loan is 1189, while density graph shows that 95% of loan’s number of investor are in the range of (0,290). The median number of investors are 44. This variable should be investigated later to see which loan attracts investors the most. Another factors that is important for investor is return rate, let’s check the return rate distribution using bar chart. Since EstimatedReturn has a lot of missing data, I need to remove them at first.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.074 0.092 0.096 0.117 0.284 29084
The avereage return rate is 9.6% overall, which is about the center of this frequency ploygen charts. There is one outler about return rate 13%, which I want to explore later. Also, Loan Amount distribution will be a interesting metric. At first, i see a very long tailed histogram and I decided to use 90% quantile to see the majority of loan Amount lies.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Obviouslly, 90% of loan amount are between 1000 and 15000, and the max amount is $35,000, the median amount is $6500. Now we know most people has median risk score, the 3rd question I will be interested is how long will people usually choose for loan? Let’s answer this question with a histogram. Since the terms is months unit, I will tranform it into year)
We can see from the result that about 3 years terms is most popular option to pay the loan.So far, we got a general idea of the prosper loan product, let’s check the overall statistic for Loan status to see how well people pay their loan.
From this fipped histogram, we can see about 33% of loan are completed, almost 50% loan are still in program, while about 10% loan are changed off. a low charged off loan rate is a indicator of good loan product. To help prosper succeed, I want to figure out which factors lead to higher chance of loan being charged off. I will use scatter plot to explore the coefficient between variable and loan. Since we explore terms of years and loan status in Univariate Analysis, I want to limit the plot to loan status ‘Completed’ and ‘Chargedoff’ by terms to see how differnt terms of loan varies for these two categories.
From above two bar talbe, we can see most terms in completed status are 3 years terms, in non_completed status, there is higher percentage of 5 years terms, this makes sense to the data.
This dataset contains 81 variable, these variable can be generally divided as 4 groups - loan/Listing variables, borrower variable, LP-Prosper Payment related variable, and time variabls. This data contains continuous, binary, nominal and discrete data. I will use different plot for different data type.
I am interesting in which customer group have a higher probablity of paying loan on time and which customer groups have a higher chance of loans being charged off.Secondly, I am also curious to see which type of loan factor affects investor’s the most, so that Prosper can benefit and make strategy of selling different loans.lastly, I also want to investigate the trend of customer/Profit change over the year to validate the past business strategy is working.
I think borrower geographical and demographical information will be very helpful for investigating the customer group. Also, EstimatedReturn rate may be a good measurement for a good lender and I can explore the correlation between EstimatedReturn rate and Furthermore, columns with prefix ‘LP_’ can be used to calculate the profit Prosper gain from each listing.
Yes, I created year varibale form LoanOriginationQuarter and percentage variable by Customer risk score and by loan status
Yes, I transformed term into year unit to have a better understanding of length of loan terms since terms in months is very scatterred. After transforming, it is clear to see 3 years’ term is the most popular option for borrower.
Additionally,let’s check the average borrower LoanOriginalAmount per loan status to see if high risk borrower have higher percentage of charged off loans. , there are a lot N/A value among completed, current and charged off listing status. Therefore, before computing the average value, N/A value is removed.
## # A tibble: 12 x 3
## LoanStatus Avg_ProsperScore n
## <fct> <dbl> <int>
## 1 Past Due (>120 days) 4.75 16
## 2 Past Due (16-30 days) 4.97 265
## 3 Past Due (61-90 days) 5.01 313
## 4 Past Due (1-15 days) 5.04 806
## 5 Past Due (31-60 days) 5.11 363
## 6 Past Due (91-120 days) 5.22 304
## 7 Chargedoff 5.39 11992
## 8 Defaulted 5.62 5018
## 9 FinalPaymentInProgress 5.75 205
## 10 Current 5.84 56576
## 11 Completed 6.54 38074
## 12 Cancelled NaN 5
From the bar chart, we can see Completed group of borrower has the highest average Prosper Score, which means they have lowest risk. apparantly, prosper score is a relaible indicator of good loan candidate.
The higher LoanOriginalAmount means higher profit contribution of borrower, which can help Prosper make suitable marketing plan. On the other hand, years of employment can be a good meansurement to find the target custmor
## # A tibble: 6 x 4
## experience avgAmt sumAmt n
## <dbl> <dbl> <int> <int>
## 1 45.7 25000 50000 2
## 2 47.0 25000 25000 1
## 3 46.2 18920 37840 2
## 4 36.8 17889 161000 9
## 5 41.7 15000 15000 1
## 6 44.1 15000 30000 2
The Scatter Plot of loan Amount vs Years of Employment Experience line shows no relationship between oan Amount and Years of Employment. Let’s try to use average LoanOriginalAmount by experience of employment at first From Average Loan Amount by Years of Employment Experience Line Chart, we can see the years of Employment Experience between 45 and 50 has the highest average LoanOriginalAmount, which is outlier, much more than other experience group. This group have high demand of loan amount per person which can be further explored.
## [1] 0.3389275
From Scatter Plot of Total Loan Amount vs Year, we cannot see proper rating system between 2006 to 2008 because the missing value form prosper rating column have been removed, which means from 2006 to 2008, there is no prosper rating systerm. Plus,the total loan amount is increasing from 2010 to 2012. large amount of 2013 loan and 2014’s loans have not been rated yet.
Also, from scatter plot of Loan Amount vs Term in year has positive correlation,their coefficient correlation is 0.3389, which is moderate level. Another question I have in mind is EstimatedReturn rate’s relationship with number of investors. Does higher return rate attract more investor? Let’s find out in scatter plot.
The estimated return and number of investor are not totally positively correlated. The range of estimated return between 5% and 17% attract the majority of investor, the rate between 4% and above 18% attract compararty small investor. This can be investigated in next section. Additionally, I also want to see the EstimatedReturn rate’s relationship with ProsperRating to see if they have strong correlation using boxplot. Do higher prosper rating have higher return rate?
From the boxplot, we can tell the highest risk rating HR result in second highest average return rate,while it has long range of return rate from -20% to 14%. Also, from AA -E, the average return rate goes up even the rating goes down. ProsperRating and return rate seem have weak negative correlation.
the dataset? From Average Prosper Score vs Loan Status chart, we can see apparantly people who has completed loan carries highest prosper score among all category. This also proves the effectiveness and accurancy of prosper score.
from the Employment Experience vs LoanOriginalAmount charts, we can conclude that group ‘between 45 and 50’ has the highest average LoanOriginalAmount, which is outlier. This is very interesting because it shows this age group people likes to borrow more debts from Prosper.
Among, above analysis, LoanOriginalAmount and Term in year has positive correlation,their coefficient correlation is 0.3389, which is moderate level.
As I stated in univariate analysis summary, geographical information can be a good part to segment borrower. In this section, 1. I will analyze borrowers’ data from geographical and demographical factor. 2. I will further explore EstimatedReturn and chargedoff loss for investor’ relationship with loan and borrower’s factor. 3. I will conduct a time study for overall proper loan and investor’s preference for loans.
This result is really intriguing, almost all the state’s completed loan borrower have higher prosper score except MT- Montana. Motana’s chargedoff loan borrower has about 6.8 Prosper score , very high compared to its completed loan borrower score 5.5. Montana is very unique case.
We can see from the graph that people who are homeowner will have higher loan than people who does not own a home in all state. IsBorrowerHomeowner is definitely a motivator for people to loan. additionaly, DC, CA and HI are top 3 states who have highest Medium Loan Original Amount. DC borrower loans much higher than other states in the chart.
The Estimated Return rate for different employment type of borrower does not vary much, but they do have different preference for different category of loan.
The investment always have gains and loss. Let’s further investigate the factors that influence borrowers LP_GrossPrincipalLoss because this will affect the rating of loan and evaluate which loan will investor loss the most money on.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -94.2 0.0 0.0 700.4 0.0 25000.0
Surprisingly, Low risk loan have higher total amount of loss, this is probably due to low risk loan has more investment. Also, high income group ($75,000 -99,999) has highest percentage in HR High risk rating loan. apearantly, the higher the income,the more risk people are willing to take.
We can see that HR rating loan has the widest range of estimated return, from -20% to 18%. Rating A, AA low risk loan don’t have negative return rate.
Before ploting, I remove 2005 and 2014 year since they only have one quater data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
Before 2009, Prosper did not have prosper rating system for their loan, therefore, I remove years before 2009 and 2014 and remove missing value in prosper rating.The total loan amount and number of investors is increasing from 2006 to 2012, which is good sign that Prosper’s business is growing. However, compared to 2012, 2013’s number is very low. In bivariate analysis, I plot year vs total loan amount, the 2013 number is higher than 2012, which may mean a large number of 2013’s loan have not been rated yet.
Let’s also check loan return rate vs vs Proper rating over years.
The graphy shows that from 2010 to 2013, the average estimated return rate is acutually decreasing for all rating of loans.
1.From 1.2 charts of Median LoanOriginalAmount vs IsBorrowerHomeowner by state shows that people who are homeowner will have higher loan than people who does not own a home in all state. IsBorrowerHomeowner is definitely a motivator for people to loan. additionaly, DC, CA and HI are top 3 states who have highest Medium Loan Original Amount. DC borrower loans much higher than other states in the chart and top natinalwide.
In bar chart of Prosper Score by Loan Status -(Completed, charged off) by state, all the state’s completed loan borrower have higher average prosper score except MT- Montana. Motana’s chargedoff loan borrower has about 6.8 Prosper score, very high compared to its completed loan borrower score 5.5.
furthermore, from chart of GrossPrincipalLoss vs ProsperRating by Income Range, surprising, Low risk loan have higher total amount of loss, this is probably due to low risk loan has more investment. Also, high income group($75,000 -99,999) has highest percentage in HR High risk rating loan. apearantly, the higher the income,the more risk people are willing to take.
The graph shows that people who are homeowner will have higher loan than people who does not own a home in all state. IsBorrowerHomeowner is definitely a motivator for people to loan. additionaly, DC, CA and HI are top 3 states who have highest Medium Loan Original Amount. DC borrowers’ average loans is about $11,000, which is much higher than other states nationalwide. Hawaii borrower has the highest percentage of not owning home among all the states.
The graphs show that low risk loan have higher total amount of loss, compared to total loan amount per rating group. Also, high income group($75,000 -99,999) has highest percentage in HR High risk rating loan. Therefore, we can conclude the higher the income,the more risk people are willing to take. In addition, we can see that HR rating loan, which has the lowest total loan amount, has the widest range of estimated return, from -20% to 18%. Rating A, AA low risk loan don’t have negative return rate. Rating Loan D and E have the highest average return rate.
Before 2009, Prosper did not have prosper rating system for their loan, therefore, I remove years before 2009 and 2014 and remove missing value in prosper rating.The total loan amount and number of investors is increasing from 2006 to 2012, which is good sign that Prosper’s business is growing. However, I noticed, compared to 2012, 2013’s number is very low. In bivariate analysis, I plot year vs total loan amount, the 2013 number is higher than 2012, which may mean a large number of 2013’s loan have not been rated yet since the latest quater is Q1 2014. Also, from these two charts, we can see prosper rating D and C loan are most popular products For investor and borrowers in Prosper in 2012, especially for borrower, these two loans are the most attractive one. While loan E’s percentage of investors’ number is decreasing from 2010 to 2012.
The biggest struggle I faced in the begining of this project is to find the most interesting question to ask and select useful variables from these 81 variables.
In order to come up a good business question to ask, I did reseach on prosper.com to understand its business model,read data disctionary several times to understand the meaning of each columns and browse blog of this dataset online. Then I conduct summary() function to check the value of each value to understand the data structure better, and I believe data analysis is designed to help business answer question,and the 1st thing about business is customer-focused and product focused. therefore, I designed the project to perform customer analysis and explore profit-driven loan for borrower and lender and conduct time study to see loans changes over the year. 1. Explore the classification of target customer - borrower and lender, respectively. 2. For borrower, the analysis will explore which group of borrower has higher chance to have the changed off debt. 3. Identify what kind of investment(combined factors) is the most popular for investor.
I succesfully divided data variable into different groups -borrower, loan, investors,and LP payment, and created all kinds of charts using ggplot2. Firstly,I created plot to show Average Prosper Score by Loan Status by state, and noticed that the better risk score and the higher completion rate.I also find out the terms and Loan amount have moderate positive correlation and home ower will more likely to request higher amount loan. For loan product side, I discovered that product B and C are most popular loan which attracts the most of investors.
Furthermore, a surprisingly interesting pattern that the lower risk loan is, the higher risk of being charged off. Also, another surprising discovery is from the Employment Experience vs LoanOriginalAmount charts, which shows group ‘between 45 and 50’ has the highest average LoanOriginalAmount, which is outlier. This is very interesting because it shows this age group people likes to borrow more debts from Prosper.
For the future improvement of data anlaysis, I will consider to discover more variable with strong/weak coefficient correlation between return rate of loan, and build a linear/logistic regression model to predict return rate for more loan product.Statistical tests, for example ttest,could be used to further explore this dataset.For instanct, ttest can be conducted using two samples from the same rating loan from 2009 and 2012 to infer the whole population to determine whether the average loan amount for loans in 2009 and 2012 are different or not.